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@ Digitel signal^^p 

@ Qievice (1) for digital signal processing, particularly video and . image pipcessihg, comprising a linear 
array of processing elements (PE's) (2), an external controller interfece (4), input^output ports (6), and 
an extemai memory internee (8).. The input/ou^ut ports (8) allow interfadhg to equipment such as 
digitizing cameras and yideo displiay monitors. Single instruction multiple data topology (SIMD) is used 
and incorporates two modes of comihunicatipn : horizontal and vertical. Shift registers (10^12) provide 
horizontal parallel conrmminication between the PE's in the. linear array. There is one PE (2) for each 
d^ital sampleV e.g. a pixel, and a sequence of d^ta samples, e.g. a video line, can be processed in 
real-time. Each PE (2) includes a bit-seriat arithmetic logic unit (14), a cache meniory (18) and an 
external memory interface register (20). All communicatioh within a PE (2) is along a one-bit wide bus, 

. . : ;;Le. vertical: mode, u^^ variable source and destinatton addresses: The/ 

processing capability of the device (i) Is further enhanced by having a serial-parallel multiplier (16) in 
each .PE (2). For applications requiring more PE*s, two or more devices can be cascfaded together. In 
another embodiment dual-port memories provide horizontal communication between the PE's In the 
linear array. 
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FIELD OF THE INVENTiON 

This invention relates to a digital signal processing device. More particuiaiiy, It relates to a linear array of 
pnDcessors suitable for video and iniage processing. 

BACKiSROUND OF INVENTION 

In video and image processing applications, real time processing is essential. Using an array of processors 
achieves high processing speed, but shifts the bottleneck to Nnter-processbr communications. Computer 
10 architectures, which reduce unnecessary chip-to-chip or processor-lo-processor communication, are one 
method of solving the communication bottleneck and thereby increasing processing speed. 

To r^uce the conrwnunication bottleneck, various processing anray architectures have been developed. 
For example there are the "pipeline image processing" systems developed at Berkeley University and described 
in an article by P.A. Ruetz and R.W. Broderson, An irti age-Recognition System Using Algorithm ically Dedicated 
. 15 Integrate Circuits. Machine Vision ApDllcations. Volume 1. dp. 3-22. January 1988 In the ayRtam HAftnrihorf hy 
■^'\/^:^^•f,i^<v;Raeltz 'Bn6 Brodefson, a set of appBcation specffic chips are conriect^ in a pipeiirie^ Le/each proc^^ss^ per- • 
fomis some processing on its input data stream and produces an output data stream for the next processor in 
the sequence. In this system, each chip contains an Algorithm Dedicated Processor array which provides the 
next stage with an array of processed data corresponding to the present stage. Typically, the size of the pro- 
20 cesser array is equal to the processing window size. The main drawback of the pipeline system approach is 
the lack of flexIbBlty due to the special purpose processors. Inriplementiations of the pipeline archit^ 
INMOS's convolution chip and the DCT chip. 

To overcome the inflexibility of special purpose processors In a pipeline system, an altemath^e is to use 
an array of general purpose processors. System architectures based on an array of general bit-p^^ 
: ; - 25 . processors have been proposed arid imptemented One such system; based on Acth^e MenioryTechriolo^ . 

(AMT) Distributed Anray of Processors (DAP) is described by D.J. Hunt in AMT DAP - A Processor Arrav in a 
Workstation Environment . Computer Systems Science and Engineering, April 1989,. Vol. 4, No. 2, pp. 107-1 14. 
Another system is Applied Intelligent System's, A1S>5000 array processor. 

Image processing systems built around general purpose processor arrays provide fiexrbility and high 
30 execution speed. The programmable processors give the system flexibility, whereas the high executk>n speed 
comes firpm distributing the computing poweramong a large number of small processors. However, some these 
systeims are hampered by the heed to access external memory devices. Typically, a processor chip contains 
only 8 to 84 bit-programmable processors and very little, if any, on-chip memory. The other problem encoun- 
^ y ^ w^^^ to link the processors ;in the twp-dimensk>nal 

35 anray. The interconnect occupies considerable chip area and also reduces the amount of dnnc^hip memory. Con- 
sequently, such ah image processing system requires numerous processor chips and external memory chips 
to achieve a viable processing architecture. The resulting complexity, size and cost of these systems prevent 
a cost-effectwe solution for many video and high-speed image processing applications. 

Accordingly, it is an object of the present invention to provide cost-effective video and inriage signal pro- 
40 cessQT using a single-chip device bas^ on a linear array of programmable bit-serial processors. 

Another object of the invention ts to provide a true bit-seriai Arithmetic Logic Unit (ALLi). having an execnjtloh 
speed in the sanne order as the ALU's of conventional bit-serial processors. 
iThe present inventton provides a linear array of bit-se 
; ^ Prgariczedan^hitecture.The^^^ 

45 of serial-bif ALU's and serial-parallel bit-multiplleiis as the nuimber of pteture elements (pixels) irr one line of a 
video or other image, typically 512 oir 1024 pbcels.per digital format irhage. and up to 1 135 pixels for standard 
PALfonhaL 

The processor elements in;the array are connected in single instructbn multiple data (SIMD) topology. The 
SIMD topology consists of a single program controller, and an anay of processing elements (P^). The controller 
50 executes the control program and broadcasts comrnands to the processing elements. The processing ele- 
ments. In hjm, receive and execute the broadcast commands on their own local data, hence the term single 
Instruction, multiple data. Each bit-serial PE includes parallel shih registers to provide fest input/output com-, 
rmjokiatbn to remote off-chip devices and local conm 

55 SUMMARY OF THE INVENTION 

According to a first aspect of the present invention, there is provided a device for digital signal processirtg 
comprising: (a) a data input poft;.(b} a data output port; (c) a liriear array of processor elements, each processor 
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element having, (i) bus means for transfening data within the processor element, (ii) a plurality of logic units 
coupled to said bus means; (d) said logic units for each processor element including an arithmetic logic unit 
(ALU), and communication interface means; (e) said communication interface means of a least some of the 
processor elements including means for receiving data from adjacent processor elements and means for trans- 
fenring data to adjacent processor elements; (f) address/data enable means coupled to said logic units of each 
processor elennent for addressing sakJ logic units and for controlling the transfer of data on said bus means; 
and (g) means for coupling said data Input port to said communication Interface means of one processor ele^ 
mentand means for coupling said data output port to said communicaUon interface means of another processor 
element, so that, all data transfer between each processor element is on said communication interface means, 
and all data transfer within each processor element is on said bus means. 

BRIEF DESCRIPTION OF THE DRAWINGS 

S pecific implementation for th e present invention will now be described, by way of example, with reference 
to the accompanying drawings in which: 

: ; Fig. 1 shows a scheniatic representaitkm of onc^ erribodiment comprising a single-chip video vnage pro- 
cessor; 

Fig. 2 Is a schematic representation of the organization of a bit-serial processor element; 
Fig. 3 is a schematic representation of a bit-serial Arithmetic Logic Unit (ALU); 
Fig. 4, is a schenVatic representation of a mernory cell in the cache memory; 
Fig. 5 Is a schematic representation of the serial-parallel multiplier logic unit; 

Fig: 6 is a schematic representation of the interface between the processior element and external memory; 
Fig. 7 shows a floor-plan of an embodiment of the Invention; 

Fig. 8 is a schematic representation of the eorrimunlcation enable unit for a processor element; 
€f Jg. fiA is %block representatiori of a rruxJrficatiqn of the arrangement shown In Fig, 8; • . = 
Fig. 9 Is ia schematic representation of an input memory cell for the dual-port memory; ^ 
Fig. 10 Is a schematic representation of an output memory cell for the dual-port memory; 
Fig. 1 1 is a schematic representatiori of a differential bus structure for the processor elements; 
Fig. 12 shows the video image processor in the basic stand-alone or single-chip configuration; and 
Fig. 13 shows the video image processor configured with external memory and a controller. 

DETAILED DESCRIPTiON OF THE PREFERRED EMBODiltflENT 

Referring to Fig. 1, adigital signal processing device 1 embodying tiie present invention includes a number 
of processing elements (PE) 2 arranged in a linear anray, two such PFs 2 are specificany shown. The device 
1 also includes an instru^ion/address controller 4, an input/output interface 6, and an exterrial memory interface 

8. • . 

. . Each PE2can contain a number of logic units. As shown in Fig. I.each PE 2 contains a first bi-directlonal 
shift register slice 10, a second bi-directional shift register slice 12, an arithmetic logic unit (ALU) 14, a serial- 
parallel multiplier 16. a cache mernory slice 18, and an external memoiy interfaoe.register 20. 

As depicted in F®. 1; each PE 2 represents a slice in the linear an^y. Simflariy within the PE 2. certain 
Ipgicunlts are slices of functional blocks which fbnm parts of the linear array. The first bi-directional shift registers 
10 in e^ PE are dic^ which conned in paraflel to form a first bi-directional shift register block 22. hiaving a 
length equal to the Bnear array; Similailyi the second bkiirectional sh» registers 12 in each PE 2 are slices 
which connect in paralld to fomi a s|econd bi-directional shift register block 24! The serial-parallel multiplieis 
ie in each PE 2 are also slices which cofinect in parallel to fonn a serial-parallel multiplier block 26. Similarly, 
the ALU's 14 in each PE 2 fonn an ALU logic Wock 27, and the cache memory slices 18 comprise th6 memory 
block 19. Parallel communication between adjacent PE's 2 allows large amounts of data to be moved in rela- 
tively short periods, thereby enhancing tile real-time processing capability of the device 1. In addition to tills 
parallel comniunication path between adjacent PE's 2, ttie device 1 also provides anotiier form of communi- 
cation within a PE 2 as will be discussed bek)w. 

The Instiuction/address controller 4 receh^es instructions from an external micro-controller or a sequencer 
28; decodes these Ihstiuctions. and then applies them to a decoder block 33 which controls the various logfc 
units in each PE 2. The controller 4 includes a controller interface 30 to ixmnect with ttie external micro-con- 
troller or the sequencer 28. . 

In. anotiier aspect of tiie present invention, the interface30 Includes a 2.5 K by 32 bit instruction RAM (Ran- 
dom Access Memory) 32. The instruction RAM 32 provides a instruction buffer between the device 1 and micro- 
cbntrqller 28 which allows, the micro-controller 28 to dump insbijctioris into the RAM 32 for execution by the 
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on-chip controller 4. Alternatively, the Instruction RAM 32 can provide an . execution buffer for instaictions which 
are downloaded from a bootstrap ROM (not shown) dunng the power-up sequence. Furthermore, the instruction 
RAM 32 can be replaced by on-chip progranfi memory, such as a programmable read only memory (PROM) or 
an electrically erasable programmable read only menriory (EEPROM) (not shown). Using a PROM or EEPROM 
5 offers the advantage of a single chip implementation for suitable applications, i.e. applications in which the con- 
trol instructions do not exceed the capacity of the on-chip program memory. In add^ 
can be reprogrammed. the device 1 can be used for new applications. 

As shown in Fig. 1 , the controller 4 also includes a decoder 34 to control the first bi-directional serial shift 
registers 10, a decoder 36 to control the second bi-directional serial shift registers 12, a decoder 38 to control 
10 . the ALU's 14, a decoder 40 to contrd the serial-parallel multipliers 16» a decoder 42 to control the cache memory 
slices 18, and a decoder 44 to control the external memory registers 20. 

Each particular logic unit is under ttie common coritrol of the conresponding decoder. For example, if the 
controller 4 inputs an instruction to multiply the pixel value by a scalar quantity, then the decoder 40 will enable 
every serial-parallel multiplier 1 6 to execute die scalar multiplication. The common control of each logic block 
IS by the conresponding decoder provides a means for implementing the SIMD architecture which is especially 
■x;;^h:i^i^^viV?•v^^^ apfrftoatkms^-V- v- •. .4: - J? "^^y- ""^ ■ 

Referring still to Fig. 1, to interface with external digital signal sources. I.e. a digitizing camera, the device 
1 includes a pair of first and second input/output ports 46,47 and a pair of third and fourth input/output ports 
48,49. The first input/output port 46 connects to one end of the first shift register block 22, and the second port 
20 47 corinects to. the other end of the register block22. Since both the register block 22 and ports 46,47 are bidirec- 
tional, either one of the ports 46,47 can be the Input or output As shown in Fig. 1 , port 46 Is the input and port 
47 is the output Simllariy for the second shift register block 24, the third input/output port 48 connects to one 
end of the block.24, and the fourth port 49 connects to the other end of the block 24. . 

The widtiis of the first pair of ports 46,47 and the second pair of ports 48,49 are the same as that of the 
L,.^'JW^:^ ' ; coirresponding shift register blocks :^,24 and depend on the digital sign^ processing appiteation. For example, 
an eight-bit pixel is common for television image applications, whereas a 12-bit data value is common in radar 
applications. 

Referring to Fig. 2. each PE 2 is a l-bit-rwide computing unit and has an internal 1-bit-wide or serial bus 
.50. The serial bus 50 connects the first and second shift register slices 10,12, the ALU 14, the serial-parallel 
30 multiplier slice 1 6, the cache memory 18 and the external memory register 20, thereby allowing the various 
logic units within the PE 2 to corhmunicate with each other. Since the bus 50 Is serial or 1 bit-wide, the logic 
: units in the PE 2 can only transfer one bit of data during the dock cyde, i.e. only one bit in each unit is 
read/written from/to the bus 50 at a time. The decoder block 33, described above, controls the source and des- 

35 The serial bus structure together with the decoders results In a menrrory-bus architecture in the PE 2. Each 

logic unit in the PE 2 has its own address much like a storage cell in a memory chip. To select a particular logic 
unit, lie. ALU 14, from the PE 2» the controller 4 generates the appropriate control signals for the decoder, i.e.. 
decoder 38, associated with the unit The decoder then enables the selected logic unit allpvying it to read or 
write a data bit from or to the serial t>us 50. Using the memory-bus organization, the device 1 provides an eff^: 
40 dent architecture whereby one or more pixels can be dedicated to a PE 2 and usirig a linear array of PE's 2 a 
cornplete line of a video image can be shifted into the devtee 1 and proce^ 

Referring still to Fig. 2, the fii^t and sepprid shift registers 10,12 each include a number of one-bit celts 52. 
: . in the ptresent embodiment, the foist and second shift regis^ 
• ' 52 tennect^ the s&rial bus 50 via line 54; thereby allovimig one bit data transfers between a ceil 52 In either 

45 register 10,12 and the bus 50. As shown in Fig. 2, each cell 52 also includes a neighbour shift register connect 
line 56. The neighbour lines 56 connect each cell 52 with the corresponding cell 52 in the adjacent registers 
1 0,12 on both sides of the cell 52. In this way. the device 1 provides parallel communication of data; i.e. pixels, 
between the PE's 2 in the linear array. Furthermore; since all the PE's 2 share the same structure and connect 
at tiieir respective shift registers 10,12, they comprise Uie global shift register blocks 22,24. 
50 As will be apparent noW from Fig. 1 and Fig, 2, the de^vice incorporates two modes of data hransfer hori- 
zontal data transfer arid vertical data hransfer. In horizontal data transfer mode, the device 1 uses the shift regi- 
ster blocks 22,24 to move data from tfie input ports 46,47 "horbiont^lly" from one PE 2 to the next PE 2 through 
^ to the output ports 48,49. Since the shift register bk>cks 22,24 can be as vvkfeasthe data sample, i.e. an eight-bit 
pixel, the device can move a sequence of digital data samples in pai-allel between the PE's 2. In vertical transfer 
55 mojde, the device 1 uses the serial bus 50 to move data, from the individual shift.registers 10,12 contained m 
the conresponding register blocks^22,24 into the various Ibgic uhits.tn the PE 2 for processing. For example, 
the device 1 cari transfer the data firom the first shift regiister 10 into the cache memory 18 and then add it to a 
scaler quantity also stored in memory-18. By having tiie PE's 2 organized as.a linear array, a whole sequence 

5 . • ' 
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of data sainples, for example a complete line of VGA* video graphics - 720 pixels, can be shifted Into the device 
1 and processed by the array of PE's 2 using the vertical transfer mode. 

To process a pixel, the device 1 uses the vertical data transfer mode to move data between the logic units 
in the PE Z As was discussed above, each RE 2 has a serial bus 50 which connects its logic units. Pixel data 
5 to be processed is moved bit-by-bit along the serial bus 50 to the logic units, processed, and then either stored 
locally in the cache memory 1 8, or moved back to the shift registers 1 0,1 2, where it can be transfer 
PE 2 or to an extemal device, i.e. a video display terminal (not shown), via the output ports 47,49/ 

These two modes of data transfer combine with the local processing capability of each PE 2 to provide a 
powerful architecture for moving and processing pixel or other digital data. Using a sampling clock or other sequ- 
10 encer, the device 1 can shift in a whole series of digital data samples horizontally via the shift registers 10,12, 
separately process each data sample using the linear anray of PE's 2. and then sequentially shift out the pro- 
cessed data samples horizontally via the shift registers 10,12. Furthemnore, since the device 1 is organized as 
a linear array, a new sequence of data samples can be shifted in at the same the previously processed sequ- 
ence of data samples is being shifted out 
75 For example in a raster-scan video application, the device 1 can input horizontally a complete line of eight 

bit \fifide'pbcels using the first shift register block 22. 'whDe simultaneously output hoiis^ntaliy ttie processed': 
pixels from a previous raster-scan line using the second shift register block 24. During the line return period of 
the raster-scan, the device 1 loads the second shift register block 24 with the processed pixels from the previous 
raster-scan line (vertical transfer), and transfers the inputted line of pixels into the cache memory block 19 (again 
20 vertical transfer) for processing. The real-time processing capability of the device 1 is further enhanced by the 
large number of bit-serial ALU's 14 and serial-parallel multipliers 16 and cache memory 18 In each PE 2 as 
will be discussed below. 

Refening to Fig. 3, the ALU 14 includes three one-bit registers: the A register 58, the B register 60, and 
the C register 62. The registers 58,60,62 store one-bit data read from the bus 50 and are addressed by the 
;:;^V.f?f?d:^^^ The A and B registers 58,60 each have two input lines 64, to the bus 

50. One of the lines 64 Is a direct input from the bus 50, while the other line 66 is the iogical complement of 
the data read from the bus 50. The combination of inverted and non-Inverted data input liries 64,66 extends 
the Boolean function capability of the ALU 14: 

Referring stHI to Rg. 3, the ALU 14 includes a fuD adder 68, which consists of a SUM logic unit 70 and a 
30 CARRY logic unit 72. The three registers 60.62.64 each have outputs 74.76,78 which feed into the 

70 and the canry logic 72, thereby providing a binary ftill adder function.. The C register 62 also includes a canry 
feedback input 80 whidh is connected to the output of the cany logic unit 72. This feedback input 80 to the C 
register 62 improves the speed of binary additions and subtractions. 
. As shown in Fig. 3, the ALU 1 4 includes a BooJean logic unit 63, The logic unit inriplements. the logical exp- 
3^ ression: AC * BC* (where C* denotes logical complement of C). In this expression, A,B.C are the contents of 
the A,B,C registers 58,60.62, and are tapped using the output liries 74.76,78, respectively. The logic unit 63 
facilitates executing certain Boolean functions in the ALU 14. 

The ALU 14 also includes a tap line 82 connecting the bus 50 to the A register 58 of the first right neighbour 
PE 2 (first right neighbour tap line), and a tap line 84 connecting the bus 50 to the A register 58 of the second 
4o neighbour to the right of the PE 2 (second right neighbour tap line). These tap lines 82,84 facilitate performing 
certain idigital signal processing algorithms such as 2-D image edge extractk>n in which smoothing Is performed 
by adding two horizontal neighbours followed by adding two vertical sums. As shown in Fig. 3, the ALU 14 
includes two similar tap lines 86,88 for the first and second left neighbours of ttie PE 2 (first and second left 

45 Consistent with the memory-bus architecture of the device, !, the registers 58.60.62, the SUM unit 70, 

CARRY unit 72 arid neighbourhood tap lines 82.84,86,88 all connect to the bus 50 as memory cells. Each unit 
in the ALU 14 includes an address line 87 (showri in dotted lines). In one aspect of the invention, ttie address 
line connects to a transmission gate 89 (shown In dotted lines) which couples the unit to the bus 50; The con- 
troller 4 through the ALU decoder 38 uses the jine 87 to address the transmission gate 89 thereby enabling 
.50 the unit to. read or write data from or to the bus 50. 

As is apparent in Fig. 3, the ALU 14 characterized by this invention i^ 
Organized around the serial bus strticture, the ALU 14 can only read or write one bit of data per dock cyde. 
; This is not the case for most other ALU's which use multiplexers to simultanebusly feed input registers with 
different (iata. By utiilzirig the serial bus structure, the device 1 avo'^^ 
55 ing significant savings in area on the semiconductor chip. 

The addresses used to control the ALlJ 14 in the present ^ 

*VGA Video Graphics Adaptor - Tradermarts IBM 
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TABLE I 

ALU ADDRESS /COMMAND OPERATION 

0 OkRRY (72) to bus (50) 

1 CARRY (72) to bus (50) and 

CARRY feedback (80) 

2 Bus (50) to C Register 

(62) 

3 Logic AC + EC* (63) to bus 

4 Bus (50) to B Register 

(60) 

5 Bus (50) to B Register 

(60) and CARRY feedback 
(80) 

6 Bus* (66) to B Register 

7 SUM (70) to bus (50) 

8 Bus (50) to A Register 
■ (58) 

9 Bus* (66) to A Register 

(58) 

•.y\:r;.: • •^lO-:'.;-: v-^;:-; . -■ /; >' ; .A::.:.;Register.v' v (-SSi^- -.\froni- : , 

second le£t neighbbuir ( 88) 
PE to Bus (50) 

11 A Register (58) from first 

left neighbour (86) PE to 
bus (50) 

12 A Register (58) to bus 

13 A Register (58) from first 

right neighbour (82) PE to 
bus (50) 

14 A Register (58) from 

. second right neighbour 

(84) PE to bus (50) 

NO operation 
. * denotes logical complement: .. 

As can be seen from the instructions listed iii TABLE K the aLu 14 operates on the basis of the MOVE or 
:data transfer instruction. By moving the data to the appropriate destination, i.e. logic unit, the ALU 1 4 processes 
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the data For example, to add three bits together Involves executing the foDowing MOVE instructions: 
8 - Bus to A; Move first data bit to A register 
4 - Bus to B; Move second data bit to B register 
2 - Bus to C; Move third data bit to C rejgister 
5 7 - SUM to Bus; Read result by moving sum to Bus 

The sum of the three bits can then be moved to the first shift register 10 or cache memory 1 8 by issuing the 
appropriate MOVE instruction. 

As illustrated in Fig. 2, the cache memory slice 1 8 connects to the serial bus 50. Each cache memory slice 
18 includes a number of single-bit data cells 90. These cells 90 all individually connect to the serial bus 50, 
10 thereby allowing the transfer of data between the bus 50 and the individual memory cell 90. By using the decoder 
42 to address a particular memory cell 90, Its data is written to the bus 50 and can then be read or inputted by 
another logic unit connected to the bus 50,'for example, the A register 58 of the ALU 14. In the present embo- 
diment the cache memory slice 18 for a PE 2 is 258 bits. Accordingly, for an eight-bit pixel. 32 pixel values 
can be stored in the cache memory 18 alone. 
...... \ ^9- organization of the data memory cell 90 is shown. In the present embodiment, each cell 90 

compirises three f^id effect translstoirs: a storage transistor92, a wirite trarisistoir94, i^^^^ a read transistor 96. 
The write transistor 94 connects between the gate of the storage transistor 92 and the bus 50. The read tran- 
sistor 96 connects between the source of the storage transistor 92 and the bus 50. 

To write, or store, a bit of data to the storage transistor 92, the controller 4 executes a MOVE instruction. 
20 The MOVE instruction involves addressing the source of the data bit. for example the SUM logic unit 70 in the 
ALU 14, to output the data bit to the bus 50. At the same time, the controller 4 executes the MOVE Instruction 
on the memory cell 90. i.e. the destination address, which through the decoder 42 turns on the vyrite transistor 
94. thereby transferring the data bit from SUM logic unit 70 via the bus 50 to the storage transistor 92. The 
data bit is stored as a charge on a parasitic capacitor 98 fonmed between the gate and source of the storage 
■>!^^U:%i;-vr:25:;:>:.transl^^ '..l--,- >■ --'L-- ; v: - ■ ' ■ \ 

To read the memory cell 90. a MOVE instruction is again executed, but for a read operation, the rriemoiy 
ceil 90 becomes the source address. For example, to transfer the data bit stored in the memory cell .90 to the 
B register 60 of the ALU 1 4, the controller 4 executes a MOVE instruction, with the read transistor 96 comprising 
the source address and the B register 60 comprising the destination adjdress. When addressed through the 
30 decoder 42. the read transistor 96 turns on and connects the storage fransistor 92 and capacitor 98 to the bus 
50. thereby writing the data bit to the bus 50. The B register 60, as the destination address, at the same time 
inputs the data bit from the bus 50. 

As is apparent from Fig. 4, the memory cells 90 of the cache memory 1 8 are dynamic random access mem- 
ory (DR/V^. Compared to stati^^^ 
35 less chip area and low power consumption. However, the DRAM memory cell 90 requires periodic refresh slg^ 
nals to regenerate the saved data since the storage capacitor 98 loses vojtage, because of leakage currents. 
In its present embodiment, the invention does not include additional refresh circuitry because of tiie high intemal 
speed at which data moves in the device 1. i.e. 20 MHz dock rate. In applieatbns where data degeheratioh is 
a factor, a refresh can be implemented in the program for the controller 4, i.e. software refresh. 
40 The serial-parallel multiplier 16 used in the present embodiment of the devtee 1 is shown in Fig: 5. The 

multiplier 16 is a bit-serial pipeline logic unit in which the multiplicand is multiplied by the multiplier bit-byrl)lt 
and the resultant product fe shifted out to the bus 50 bit-by-bIt : 
• . As shown b Rg. 5, the multiplier 16 consists of a numb^ of full addeirs 1M con nected in ^ries or pipelbie. 

The lull adders 100 are cascaded In sequence. coTrespbnding ta most significant bk (MSB) df multiplicand to 
45 leastsignlficant bit (LSB) of multiplicand. Each full add^^^^^^ 

100 to the bus 50. As depicted in Fig. 5, tfie bus jSO interfaces to the flill adders 100 through a NOR gate 102. . 
except for the MSB hill adder 100 which uses an OR gate 104. Each iogic giate 102.104 has a first input 106 
• and a sewnd input 108. The first input 106 lo^ds ttie multiplicand bit, with its address con-esponding to the 
MSB to LSB sequence. The second input 1 08 is the bit of nriiultipliero^ 
50 mentation of the multiplier 1 6, tiie second input 1 08 is common to all the logic gates 1 06, i08. 

Refening still to Fig. 5, each full adder 1 00 also includes firist and second one-bit registers 1 10,112. The 
first one-bit register 1 1 0 provides storage for ttie SUM resu^ 
: one-bit fegister 1 12 provides stordge for the CARRY result from the cpmresponding full adder 100. These regi- 
sters 110,112 store the results from the previous addition in tfie sequence. The SUM register 110 feeds into 
55 .ttie full adder 1 00 of tt}e next stage, and ttie.CARRY register 112 feeds back into ttie full adder 1 00 at the same 
stage: A one-bit product register 1 1 1 replaces the SUM register i/t 0 of the fUll adder 100 for tiie LSB. The pro- 
. duct register 1 1 1 connects to the bus 60, and using this register 1 11 ttie multiplier serially shifts out tiie product - 
As shown in Fig. 5, each register 1 10.1 1 1,112 also cdnnecte 
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instruction/address controller 4, the reset line 113 provides an initialization signal to put the three registei^ 
11 0. 1 1 1 , 112 into a known state, i.e. logic zero, prior to a multiplication operation. 

In the present embodiment, there are 1 0 full adders 100 cascaded together, thereby allowing a 1 0-bit mul- 
tiplicand to be multiplied in a senal-pipeline fashion. Table li lists the instructions used to control the multiplier 
5 16. 

^0 MULTIPLIER ADDRESS OPERATION 

0 to 9 Bus (50) to input (106) 

for LSB to USB, i.e. 
address to load lO-bit 
15: . . \ w ^ multiplicand bit bj^ bit 

10 Bus (50) to second logic 

gate input (108) and shift 
result to. registers 

20 (110,111,112) 

11 Bus (50) to second logic 

gate input (108) 

. ; 12 : - .^^^^^ 

13 Product register (111) to 

bus (50) and shift 
registers (110,111,112) 



30 



3S 



14 Reset (113) multiplier 

(16) 



Consistent with the niennory-tsus architecture of the device 1 , the Instructions in Table 11 comprise MOVE 
instructions with source and destination addresses, which provide data transfer to/from the bus 50 to the mul- 
tiplier:i6. For example, there are instructions; for loading the multiplicand LSB to MSB, i.e. addresses 0 to 9 
,40 on input lines 1 06 for the multiplier 1 6; for serially loading the multiplier, i.e. addresses 1 0 arid 1 1 for the mul- 
. tiplier 1 6; and for serially shifting the resultant product to Ihe bus 50, Le. address 1 3 for the serial multiplier 1 6. 
Using these instructions, the multiplier 16 can serially multiply a ID-bit multiplicand and write the riesulting pro< , 
duct into the cache mernory 18 or the shift reglste : 

: Table III lists the remaining Inslrudions for the device 1. As cart be seisn; these address/coiTmainds indud^^^^^^ 
45 instructions fdr.controlling the remaining logic units in. the PE's 2. suchas the first aind.^econd shift registers 
10J2, the cache mernory t8.. and the external memory interfa 
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TABTiR III 

(1) FIRST SHIFT REGISTER (10) 

ADDRBSS/CQMMAMD OPKRATIOH 

0 to 7 Bus (50) to bitO to bit7^ 

i.e. cells (52), of first 
shift register (10). 

8 to 15 BitO to bit?, cells (52), 

of first shift register 
(10) to bus (50) . 

^6\^.y:yy^^ ^ SHIFT fixst register { 10) 

to right ( horizontally) . 

17 SHIFT first register (10) 

to left (horizontally) . 

18 Standby 



;12^ : SJ^^ (12) 
ADDRESS /COMMAND OPERATION 

0 to 7 Biis (50) to bitO to bit7, 

i.e. cells (52)> of second 
shift register (12). 

8 to 15 BitO to bit?, cells (52), 

„ of second shift register 
(12) to bus (50); ^^^ ^ 



16 . SHIFT second register (12) 

to right (horizontally) . 

17 SHIFT second register ( 12 ) 

to left (horizontally). 

18 Standby 

(3 ) CACHE MEMORT (18) 

ADDRESS /COMMAND OPERATiON 

0 to 256 BitO to bit256, i.e. cells 

(90), of cache memory (18) 
to bus (50). 

257 to 511 Bus (50) to bitO to 

bit256, cell (90), of 
cache memory (18). 
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(4) EXTERNAL MEMORY INTERFACE (8) 
ADDRESS /COHMAMD OPERATION 
0 to 7 



External memory ( 114 ) to 
one of the interface 
registers (20), i.e. 0 to 



10 



IS 



20 



9 to 15 



16 



One of the interface 
registers (20)> i.e. 0 to 
7, to external memory 
(114). 

interface register (20) to 
bus (50). 



17 



Bus (50) to 
register (20) . 



interface 



18 to 26 



One of eight bits from 
external read only memory 
J 114 ), tf>,,bps ( 50)^,J ^- ^ 



30 
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40 



45 



50 



55 



As is apparent from Tables I. II and III, the commands for controlling the device 1 all reduce to a single MOVE 
instruction. By using a source addriess and a destination address, the MOVE Instruction transfers one bit of 
data from the source along the bus 50 to the destination address: By having the logic units in a PE 2 organized 
as memory addresses on the serial bus 50, the data can be broadcast from one source address to many des- 
tination addresses. Since the device 1 is a single Instruction computer (SIC), i.e. the ektreme case of a Reduced 
Instruction Set Computer (RISC), computation occurs by having logic circuits located at the.addresses on the 
serial bus 50 and writing or reading to the particular address. 

: : 'n scwne image and vid^eo processing applications, it may be necessary to store more data thari can be 
saved in the on-chip cache memory 18. For such applications, the device 1 includes the extemal memc^ Inter- 
face 8 depicted in Rg* 6. The external memory interface 8 allows the device 1 to have an expanded storage 
capacity, by connecting the. anray of PE's 2 to an extierhal memory device 114, such as a RAM chip or other 
mass storage device. The external memory device 1 14 could also be a read only memory (ROM) chip for an 
application requiring a look-up table for example. 

As shown in Fig. 6, the ext^ial menruDry interface 8 includes the interface register 20 for each PE 2 and 
a series of multipiexers 116. In each PE 2, the register 20 connects the bus 60 to ari input 1 18 of the ihulliplexer 
1 16, and en output 120 of the multiplexer 1 16 connects to an output pad 122.. The output pad 122 provides a 
, c»nlact point for connecting the pins or leads (not shbwn) on the carrier or package to the semiconductor die. 
- Referring still to Fig. 6, the register 20 provides a buffer function whteh is necessary eince the internal speed 
of the bus 50 can be cnahy times the access time of the external memory device 114" In the first aspect of the 
present Invention, the registers 20 buffer one bit of data. In another aspect of the present invention, the register 
20 includes 8 one bit cells to provide buffering for one byte of data, thereby allowing the device 1 tb interface 
. with slow external memory devices 1 14. . • \ 

Although mulUplexers can occupy corisklerable chip area, in the external memory interface 8 they save 
chip area because of output pad 122 and pin-but Iffnitations. To provide a direct one-to-one interfabe for each 
PE 2 would result in a high output pad and pin count which could make fabricating and packaging the device 
1 very difficult, if not an impossibility. Consequently, by utilizing the multipfexers 1 1 6 to route the PE's 2 Id the 
external mernory device 1 14, there is a substantial savings In chip area and package size, without any signifh 
cant loss In external access since register 20 provides a buffer for slower memory devices 114. 

In the aiustrated embodiment of the present invention, the interface 8 utilizes ia number of 8-to-1 multip-. 
lexers 1 16 to connect the PE's 2 to the external mernory 114. For k linear anray of 1024 PE's 2, the interface 
8 requires 1 28 of the multiplexers 1 1 6. A decoder 124 selects the multiplexer 1 16 for external memory output 
The decoder 124 generates a multiplexer select signal based on seven address inputs from the controller 4. 
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In the SIMP architecture of the device 1, each of the 128 multipiexers 116 selects one of the interface registers 
20 connecting to its eight input lines 118. based on the external memory Instructions listed in TABLE III. For 
example, to output the data bit from the second PE in the group of eight, the controller 4 requires the 
address/command 10 which enables interface register 1. In other words, the device 1 utilizes a two-tier addres- 
sing scheme to output data to the external memory 1 14. Le. addressing the nmitttplexer 116. then selecting the 
input 118 from one of the eight interface registers 20. 

Reference is next made to Rg. 7, which shows the floor plan, i.e. layout of the device 1 on a semiconductor 
die or "chip". Based on a die size of one square centimetre and using a sub-micron pitch, l.e. line-width, a linear 
array of 1024 processing elements 2 can be Implemented as an upper group 126 and a lower group 128. In 
the layout, there are 512 PE's 2 arranged in the upper group 126 and 512 PE's 2 arranged in the lower group 
128. As shown in fig. 7. the single cache memory block 19 is split between the upper group 126 and the lower 
group 128. The PE's 2 in the upper group use odd columns of the cache memory block 19 and the PE's 2 in 
the lower group use even columns of the cache memory block 19. This layout is possible since the pitch of the 
cache memory block 19 can be finer than the pitch of an individual PE 2. In addition, this layout organization 
is an efficient utilization of the chip area 

As discussed above/each P'E 2 comprises a numbier of logic units: the finst and second bi-dnrec^bnai shift 
registers 10,12, the ALU 14, the serial-parallel multiplier 1 6 and the external memory interface register 20. As 
shown in Fig. 7, each of these logic units is laid out as a linear array of 512 elements corresponding to the upper 
igroup 126 and the lower group 128 of the processing element annay. In addition to the first and second shift 
register blocks 22,24, the ALtJ logic block 27. and the senal-multiplier block 26, there is an external memory 
Interface block 134. Also located on the same chip is the controller interface 30, whbh can include the instruction 
RAM 32 or PROM (not shown). Using this fcrfock structure, less interconnect is required between the various 
logic units in the PE 2, which results in efficient use of the chip area. 

Referring still to Fig. 7, since the linear array of PE's 2 is divided into lower and upper groups 126,128, a 
: first and second bus 137,139 jobri the respective first and second shift reglst^ t)locks 22,24 to fonm the 1024 . 
bit long register blocks 22,24. As discussed above, the register blocks 22,24 provide the horizontal communi- 
cation path between all the PE's 2 in the linear array. A similar bus 1 30 joins the corresponding lower and upper 
groups of the ALU logic tHock 27. 

Reference is next made to Fig. 8 concerning another aspect of the present invention in which the PE 2 
includes a communicatk>n enable unit 132. The communication enable unit 132 enables or disables a particular 
PE 2 from communicating with adjacent PE's 2 , i.e. to the right and to the left, by controlling the data path be- 
tween the cells 52 In adjacent shift register slices 10,12. Using the enable unit 132, a particular PE 2 caii t>e 
isolated from the horizontal comnniunication path. Isolating a particular PE 2 provides the present invention with 
radditional fiinctionality, as will be discussed bek>\ft^^ i 

Each PE 2 Includes its own communication enable unit 1 32 which Is coupled to the shift registers 10,12. 
In ttie following.dlscussk)n. the operation of the communication enable unit 132 will t>e discussed with respect 
to the first shift register 1 0, as illustrated in Fig^ 8. The operation of the enable unit 1 32 is identical for the second 
shift register 12, as will be evident to one skilled \n the art. As shown in Fig. 8, the communication enable unit 
1 32 couples to the shift register 10. The unit 132 includes a state register 1 33, a riumbeir of first control switches 
135, and a number of second control svwtches 1 53/ There are first and second control switches 135.153 for 
each one of the cells 52 in the shift registier 10. A control line 141 connects all of the first and second control 
switches 135,153 to the state register 133, and using the line 141 the state register 133 controls the switches 

■135,153. ., ■ . •* ^ . • , 

^ Referring still to.Rg. i3, each first conthiitf swi^ 135 includes a data feed fnpbt 143, a data by-piass in^ut 
145 and an output 147. The data feed input 143 connects to a neighbour ceH oiitput line 149 from the cell 52 
in the shift register 10, whereas the data by-pass input 145 connects to the neighbour cell input line 56. The 
output 147 of the switoh 135 connectsto the neighbour cell input line 58 of the adjacent PE 2 to the right, as 
depicted in Fig, 8. Under the control of the state register 133, the switch 135 selects one of the two inputs 
143,145 for routing to the output 147 and adjacent PE 2 to the right 

As shown in Fig. 8i.the second control switch 153 is a mirror image of the first control switch 135. Each 
second control switch 153 includes a single input line 155, a data feed output 157, and a data by-pass output 
159. The Input line 155 conriects to the neighbour cell output line 149 of the adjacent PE 2 to the right as shown 
in Fig. 8. The data feed outpajt 157 connects to a neigh bour cell input line 161 on the cell 52 of the shift register 
10. whereas the data by-pass output 159 connects the neighbour cell output line 1 49 of the cell 52 of the shift 
fiegisterlO. Similarto the first control switch 135, the second control switch 153, under the control of the state 
register 133, i.e. using the control line 141, roubes the input 155 connected to the neighbour line 149 of the 
adjacent PE 2 to one of the outputs 157,159. 

For exantiple, to isolate a particular PE 2, under the control of the state register 133, the first switch 135 
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selects the neighbour ceil input line 56 of the adjacent PE 2 to the left, i.e. data by-pass input 145, and the 
second switch 1 53 routes the neighbour cel^ output line 149, of the adjacent PE 2 to the right, to the data by-pass 
output 159. As shown in Fig. 8. this effectively "shorts out" the cell 52 of the shift register 10 in the particular 
PE 2 in both data directions, Le. data right and data left, thereby isolating the PE 2 from the horizontal com- 
munication path. As is evident to one skilled in the art, the bi-directionai^ data transfer capability of the shift regi- 
sters 10,12 requires that first and second switches 135,153 be used to isolate the cells 52 in both directions, 
if uni-directional devices (to be discussed below) replace the shift registers 1 0,1 2, then the only a single control 
switch 135 or 1 53 would be required to Isolate the PE 2 from the adjacent right and left PE's 2. 

As discussed above, the state register 1 33 controls the routing path of the first and second control switches 
135,153. In the present embodiment of the invention, the state register 1 33 comprises a single-bit register, with 
the data bit controlling the state of the first and second control switches 135,1 53. As shown in Fig. 8, a line 151 
connects the state register 133 of each PE 2 to the state register 133 in the adjacent PE's 2, The connection 
of the state registers 1 33 allows a serial control word, i.e. the string of data bits for each state register 1 33, to 
be serially shifted into the device 1. Since the first and second control switches 135.153 for each PE 2 are con- 
trolled by the data stored in the state register 1 33 for that PE 2. the enable units 132 are partlculariy suited to 
pfGiprahuhed contrbi. For example, an external micro-cbntibHer 28 lising the controller inter^K^s SO .cah b6 used 
to serially shift in the control word, or in another application the control word can be stored in the bootstrap 
ROM (not shown) and downloaded as part of the power-up sequence as discussed below. 

In another aspect of the present invention, the state register 133 structure cian be implemented using a 
programmable storage device 133a such as a PROM or an EEPROM as depicted in Fig. 8A. Each state register 
133 is replaced by a memory location in the programmable storage device 133a. The memory location is 
associated with a particular PE 2 iri the linear array and the enable state of the particular PE 2 is programmed, 
or stored, by writing the enable state to the address of the conresponding itiemory location. This alternative 
implementation of the state register 133 offers the additional advantage that the storage device can be prog- 
;; rammed to disable any faulty PE's at the test and packaging stage of the device 1 ,.while still offering the end- 
user the capability to program the functioning PE's according to the requirements of the digital signal processing 
application for the device 1. 

Using the serial control word under programmable control allows the state of the PE's, i;e. enabled or dis- 
abledi to l>e changed "on the fly*. This ability to quickly select particular PE's in the linear array provides 
additionsd functionality and flexibility. By way of example, three programmable function which the comiinunH 
cation enable units 132 can provide are: (1) PE testing; (2) faulty PE isolation; and (3) programmable linear 
array length. 

The communication enable units 132 allow each PE 2 in the linear array to be individually tested and exer- 
r^cts^ Jhis Is pa^ticuiarty onportant when the device 1 is first cut ifrom the semicorKJuctor substrate and mounted 
in a test fixture. To test a particular PE In the linear array, the control word enables tfiat PE while disabling all 
the other PE's. For example, to test the fourth PE in the linear array, the control word would contain ail zeros 
i9xcept for a logic 1 in the fourth bit position. 

Related to the testability function is the faulty PE isolation functiori. Once a faulty PE is detected during 
the testing algorithm, the. faulty PE can be isolated by setting the bit and state register corresponding to that 
particular PE. For example, .if in a device having a linear array of 1,024 PE's therie are found to be ten faulty 
PE's, the location of these faulty PE's can be documented for the customer. As part of ftrtnware for the external 
micrcKcontroller 28, for example^ the customer disables the faulty PE's.. The testing and isolating functions also 
increase the number of usatile devices which can be shipped to customers. For example, in applicatiohs requir* .. 
ing smaller linear arrays, devtoes 1 whk:h include a number of faulty PE's; but a sufficient number offiinctioning 
PE's, can still be used. 

The third function is the programmability of the linear array length. As is now apparent, the serial control 
word shifted into the state registers 133 can be used not only to isolate particular PE-s 2, but also to control 
the length of the linear array. To control the length of the linear anray. the control word simply enables the 
required number of PE's 2. For example, if a device has a linear array of 1,024 PE's and only 720 PE's are 
required, e.g. VGA graphics, PE's 720 to 1.024 can be disabled by writing zeros to the bits in the control word 
conresponding to these PE's. Furthermore, disabling the unused PE's 2 mcreases the effective bandwidth of 
the device 1 in the horizontal communication path. In the present embodiment of the Inventldn, the communi- 
cation enable unit 132 was described ias including one state register 1 33; however, by including additional state 
registers (not shown) multiple configurations of the linear array can be stored locally in the device 1, thereby 
facilitating *oh the fly" changes to the linear array configuration for the particular application. 

In another aspect of the present invention, dual-port memories (not shown) 'can be used to replace ohe or 
both of the shift register blocks 22,24. The dual-port nriemoriies provide similar functionality, i!e. data Input or 
output and datia storage as the shift register blocks 22,24. The dual-port memories also have the advantage 

■ * 13 " ■ * 
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that a memory cell occupies less chip area ^than a cell in the shift register blocks 22,24. However, the dual-port 
memories are limited to horizontal communication between the data input ports 46.48 and output ports 47,49, 
and cannot be used for data transfer between the PE's. Accordingly, it is advantageous to include at least one 
shift register block 22,24. Furthenmore at the design stage, the dual-port memory must be fixed for data input 
or for data output since dual-port memory not bi-directionaK In addition, as is apparent to one skilled in the 
art shift registers can be clocked at extremely fast rates, thereby providing a high bandwidth. 

Accordingly by using the first and second right neighbour tap lines 82,84 and the first and second left 
neighbour tap lines 86.88 to provide for communication between adjacent PPs in the linear an^, the dual-port 
memories can replace both first and second shift register blocks 22,24 

Reference is next made to Fig. 9, which shows a data input memory cell 136 for the dual-port memory. 
The data cell 136 connects to a data input bus 138 which in turn connects to one of the input-output ports 46,48 
(not shown in Fig. 9). As shown in Fig. 9, the data cell 136 also connects to the serial bus 50, thereby allowing 
data from an external device, i.e. digitizing camera, to be stored In the cell and processed by the PE 2. Once 
the data bit is processed, the PE 2 can store the processed data bit in cache memory 1 8 or transfer it to another 
7?.. 2 "^'"9 i^e^^ neighbour taps 86,88,90,92 discussed above, or the PE 2 can write the processed data to 
v->v-^-^ual-portinerhbry £^ o/:^;^ \^^v 

As shown in Fig. 9, the data input cell 136 includes an input transistor 140. a data storage transistor 142, 
an output transistor 144 and a parasitic capacitor 146 for storing the data bit The capacitance between the 
gate and the source of the storage transistor 142 fomns the parasitic capacitor 146. The input transistor 140 
connects the capacitor 146, Le. gate of the storage transistor 142, to the data input bus 138. By activating a 
data-in address line 148 on the Input transistor 140, the data bit on the bus 138 is stored on the parasitic capaci- 
tor 146. To transfer the data bit to the PE 2. via the serial bus 50. the output transistor 144 includes a transfer 
address line 150 connected to Its gate. By addressing the transfer line 150. the output transistor 144 is turned 
on and the data bit stored on the capacitor 146 is transferred to the serial bus 50 via the output transistor 144. 
?: According, to ttie mernory bus prgariizatlon of the PE 2, the data bit once onithe serial bus^ can be written to • 
any destination address In the PE 2, for example, the ALU 14. 

Reference is next made to Fig. 10, which shows a data output cell 152 for the dual-port memory. The data 
cell 152 connects to a data output bus 154 which is connected to one of the output ports 48.49 (not shown). 
As shown in Fig. 10, the data ceU 152 ^Iso connects to the serial bus 50, thereby allowing data, which was 
30 processed by the PE 2. to be stored in the ceH 152 and then outputted on a data output bus 154 to an extemai. 
' device such as a display terminal. 

As shown in Fig. 10. the data output cell 152 for the dual-port memory includes an input transistor 1 56, a 
data storage transistor 158, and output transistor 160, and a parasitic capacitor 162 for storing the data bit. 
The capacitance between^ the gate and the source of the storage capacitbr 158 fomis the parasitic capacitor 
35 162. The Input transistor 156 connects the parasitic capacitor 1 62 to the serial bus 50. By activating a data 
transfer address line 1(S4 connected to the gate of the input transistor 156. the data bit on the serial bus 50 is 
stored on thb parasitic capacitor ieiz. To transfer the data bit to the data output bus 154. the output transistor 
160 includes a data out address line 1 66 connected to its gate. By activating the data out address line 166, the 
output transistor 1 60 is tomed on and the data bit stored on the capacitor 1 62 is transfen-ed to the data out'bus 
40 154. via the output transistor 160. Once on the data output bus 154. the date bit can be written to an external 
device such as a display terminal as menttoned above. 

In the previous discussion, the menrtory^Dus organization of the device 1 has been illustrated using the serial 
bus 50. In another aspect of the present Invention, a differential bus structure 165^ as depicted iri Fig. 1 4. rep- 
: : laces the s^ bus 50. Insteadof a single-bit line as for the serial bus 50. the differential bus 

45. 1 65 has a date-bit line 167 and complemented data-bit line 1 69. I.e. the logfcal complement of the date-bit line 
167.- . • ■• . 

As shown in: Fig. -j i for the cache memory 18. each cell 90 has a first tep 171 into the data-bit line 167 and 
a second tep 1 73 into ttie complemented data-bit line 1 69. The cache memory 18 also includes additional logic 
(not shown) for reading from and writing to tile complemented date4)it line 169. In ttiis aspect of the present 

50 Invention, ttie otiier logic units (not shown), such as the ALU 1 4. the multiplier 16, also have corresponding 
first teps into the data-bit line 1 67 and second teps into tfie complemented data-bit line 169. along with additional 
logic for processing tile complemented and non-complemented date frdm ttie differential bus 165. 

In anotiier aspect of the present invention (notshown), a nmjlti-bit wide bus, e.g. a twelve-bit bus, replaces 
ttie serial bus 50. While a twelve-bit wide bus increases the chip area occupied by each PE 2, it significantfy 

55 increases the bandwidth andprocessihg powerof each PE 2. The increased processing power and bandwidth 
is suiteble for applications where large amounteof data must be processed, and In applications where flexibili^ - 
for processing a variable length date sample is not a design requirement As is evident to one skOled in tfie art, 
for a twelve-bit wide bus, the logic unite, such.as tfie ALU 14. the multiplier,16 and cache memory 18, would 

14 



BNSbOCiO: <EP_^04^lX2 1 >! 



EP 0463 721 A2 



also have twelve-bit Interfaces to the bus. 

Reference Is next made to Fig. 1 2, which shows the basic configuration or the stand-alone system for video 
or image processing system. The device 1 receives digital signals/samples, via ports 46,47, from an external 
device, which can be a digitizing camera 168 or an analogue camera connected to an analogue-to-digital con- 

5 verter (not shown). The inputted digital signals are processed by the device 1 and outputted via the ports 48,49 
to an external device such as a high resolution monitor 172. For many video or image processing tasks, such 
as median filtering, linear filtering, morphological operations, sobel and threshold operations and discrete 
cosine transfonms, the on^hip cache memory 1 9 together with the storage capacity of the shift registers 22,24 
is more than sufficient and the device 1 can be controlled by simple control unit, such as a finite state machine, 

10 or by instructions downloaded to the on-chip instruction RAM 3Z A bootstrap ROM 1 70 can be used to download 
the instaictions to the on-chip RAM 32 as part of a power-up sequence for the device 1 . 

As shown in pig. 13, the device 1 can be configured with external memory 1 14. In some image processing 
applications, such as a motion feature encoder, storage capacity beyond that available in cache memory 19 
is necessary to savethe frame image. Using the external memory interface discussed above, the device 1 can 

15 be interfaced \o two high density static RAM chips 1 14 as shown in Fig. 13. An external controller 28, i.e. micro- 

v: ^ ooriirotIeK provider the control for thle^ device 1 and rhemoryJriteirfate.8. 

It will be evident to those skilled In the art that other embodiments of the invention fall wfthin Its spirit and 
scope as. defined by the following daims. 
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Claims 



1. A device for digital signal processing comprising: 
. (a) a data Input port; 

(c) a linear array of processor elements, each processor element hiaving 
(i) bus means for transferring data within the processor element, 

(11) a pHurallty of logic units coupled to said biis means; 

(d) said logic linit^ for each processor element including an arithmetic logic unit (ALU), and communi- 
30 cation interface means; 

(e) said communication interface means of at least some of the processor elements including means 
for receiving data firom adjacent processor elements and means for transferring data to adjacent pro- 
cessor elements; 

(0 address/data enable means coupled to said logic units of each processor element for addressing 
35 ~ • said logic units and for controlling the transfer of data on said bus riieans; and 

(g) means for coupling said data input pori to said communication interface means of one processor 
element and means for coupling said data output port to said communication interface means oif another 
processor element, 

so that, all data transfer between said processor elements is on said communication interface means, and 
40 all data transfer within fsach processor element is on said bus means. 

2. A device as claimed in daim 1 , wherein said bus means of each processor element comprises differential 
bus nrieans and said logic units indude meems for recei^^ 

■ 45. . • 

3. A device as claimed in claim 1 or daim 2, wherein said bus means of each processor element comprises 
. / bit-wide serial bus means and said logic units include means for receiving and transmitting bit-serial data 

from, and to said bus means. 

50 4. A device as claimed in daim 1, wherein said logic units in each processor element include a nfiultiplier 
unit 

& A device as daimed in daim 4, wherein said multiplier unit in each process 
multiplier, said serial-tchpai^llel multiplier having a plurality of stages^ 
55 : stage being coupled to the. next said stage and said last stage being coupled to sajd bus means. . 

6. . A devibe as daimed in any preceding daim, wherein said logic units in each processor element indude 
a cache memory, said cache memory having a plurality of cells each coupled to said bus means and to 
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said address/data enable means, so that each cell in said cache memory can read and write data from 
and to said bus means. 

7. A device as daimed In any preceding claim, further including a second data input port, a second data output 
port, and second communication interface means, said second communication interface means of at least 

. sorne of the processor elements including means for receiving data from adjacent processor elements and 
meansfor transferring data to adjacent processor elements, arid means for coupling said second data input 
port to said second communication interface means of one processor element and means for coupling said 
second data output port to said second communication interface means of another processor element. 

8. A device as daimed in any preceding daim, wherein said logic units of each processor element indude 
means for interfacing to an external memory device, said means for interfacing to an external memory hav- 
ing at least one register.coupled to said bus means and to said address/data enable means, so that each 
pnscessior element in said linear anray can receive and transmit data from and to an external memory 
device. 

9. A device as daimed in any preceding daim, wherein means or said means of saidlogic units for interfacing 
to an external memory device, indudes or further includes multiplexing means for selecting one from a 
plurality of said processor elements to interface with an extemalmemory device, and said multiplexing 
means being connected to said address/data enable means. 

10. . A device as daimed in any preceding daim, wherein said communication interface means comprise bi- 
directional shift registers. 

11..; A device.as daimed iri. any one of daims 1 to 9r,wherein.said communic^^ nieans cornprise 

duai-port memory. 

1Z A device as claimed in any preceding claim, wheretri said ALU includes a plurality of registers each having 
. at least one output and at least one input bdth coupled to said bus means and to said address/data enable 
means, adder means for adding a plurality of data bits, said adder means having a sum unit coupled to 
said bus means and to said address/data enable means and connected to the outputs of said registers, 
said adder means also having a carry unit. coupled to said bus means and to said address/data enable 
means and connected to the outputs of said registers. 

13. A devfce as dairiried In claim 1 2, wherein said plurality of registers comprise a first reg ister, a second regi- 
ster, and a third register, and said third register indudes an input connected to the output of said carry 
unit 

14. A device as c|aamed in dairn 13, wherein said ALU indudes a Boolean logic unit having an output coupled 
to said bus rheans and being coupled to said address/data enable means and having Inputs connected 
to the outputs of said first, .second, and third registers. 

\ ' . . . . .." • . ' ' 

15. A device as dasned in claim 14, wherein said ALU indudes a pluiBlity of right neighbour tap lines coupled 
to said bus means land to said address/data enable means, and a plurality of left n^ghbour tap fines coup- 
led to said bus means and to said address/data enable means, so that a processor element can transfer 
data to. and from neighbouring proceissor elements in said linear arra 

16. A device as claimed in claim 15, wherein said ALIJ indudes a first right neighbour tiap line and a second 
right neighbour tap line, and a first left, neighbour tap line and a second left neighbour tap line, so that a 
processor element using said bus means arid said address/data enable means can transfer data to and 
from ite ftrst and second right neighbours arid. to and from its first and^cond left neighbour processor 
elements in said linear array. 

17. A devic^ as daimed in daim 16, wherein said registers indude complemented inputs coupled to said buis 
means and non-complemented inputs coupled to said bus means. 

18. A device as claiined in any preceding claim,, wherein said device Is fabricated on a semiconductor sub- 
strate. 
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19. A device as daimed in daim 1 8, wherein said semiconductor substrate indudes, 

(a) a or said cache memory integrated into said semiconductor substrate; 

(b) an upper said processor element structure integrated into said semiconductor substrate; 

(c) a lower said processor element structure integrated into said semiconductor substrate; and 

5 (d) instruction/address controller means induding a decoder logic unit integrated into said semiconduc- 

tor substrate. 

so that, the device is fabricated on saki semiconductor sub^ 

20. A device as claimed in claim 19» wherein said cache memory structure has four sides, being generally 
10 rectangular In shape and being integrated into a centre portion of said semiconductor substrate, and said 

upper processor element structure being integrated into a portion of said semiconductor substrate adjacent 
to one side of said cache.memory structure, said lower processor element structure being integrated into 
a portion of said semiconductor substrate adjacent to the second side of said cache memory structure, 
and said instruction/controller structure being integrated into a portion of said semiconductor substrate 
15 adjacent to the third side of said cache memory stru^ 

21. A device as daimed in daim 20. wherein said first and second sides are on opposing sides of said cachiB 
memory structure, and said third side extends between said first and second sides, so that the integrated 
circuit aspect ratio on said semiconductor substrate is near unity. 

20 

22. A device as daimed in daim 20 or 21 , wherein said cache memory structure indudes even columns, each 
even column having a plurality of memory cells, and odd columns, each odd column having a plurality of 
memory cells, and said upper processor element structure being coupled to said even columns of said 
cache memory structure, and.said lower processor element structure being coupled to said odd columns 

:.2S.: of said cadhameam ... ^ 1^'-.^::.;-. -r-: .;"v-^:^ h ' . '^ o'^^r 

: 23. A device as daimed in any one of daims 1 9 to 22, further including a bus structure coupling said upper 
processor element structure to said lower processor element strMcture, so that said upper pirocessor ele- 
ment structure and said lower processor element structure Ibnrn a continuous linear array of processor 
30 elements. 

24. . A device as claimed in any preceding claim, induding communication enable means associated with at 

least some of said processor elements for selectively enabling and disabling such associated processor 
. . elernenls finorn receiying data from adjacent processor eiements and from transfening data to adjacent prp- 
35 cessor elements; . 

25. A device as daimed in daim 24, wherein said communication enable means include state register means 
for storing the enable state of said associated processor elernents and switching rheans coupled to said 
state register means and to said first and second communicatioh interface mearis of said associated pro- 

40 cessor elements for selectively enabling arid disabling such assoa'ated processor elements from receiving 

data from adjacent processor elentents and from transferririg data to adjacentprocesspr elements. 

2$. A device as daimed In daim 25. wherein said state register mean^ indude rneans for recefylng enable 

45. :'. 

27.. A device as claimed In claim 26,^ wherein said state register means Indude storage means for storing more 
thian one set. of said enable state data for said associated processqr elements. 

28. A device as daimed In daim 27. wherein said storage means comprisei a plurality of data registers. 

29. A device as claimed in daim 27, wherein said storage means comprise an electrically erasabie programm- 
able read only memory. 

30. A device as daimed in any one of daims 25 to 29. wherein said switching means indude a first control 
55^ switch for enabling and.disablihg said associated processor elements from transferring data to adjacent 

processor elenrients, and a second control switeh for enabling and disabling said assodated processor 
elements from receiving data from adjacent processor elements, andsaid first control switch and said sec- 
. . ond control switch both being connected to said state register means and to said first and second com- 
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munication interface means of said associated processor elements. 

31. A device as claimed in any precedmg claim, further including cpntrolier interface means coupled to said 
address/data enable means for interfacing to an external controller 

32. A device as claimed in daim 31 , wherein said controller interface means include storage means for storing 
control instructions received from said external controller. 

33. A device as claimed in daim 32, wherein said storage means comprise a random access memory. 

34. A device as daimed In daim 32 or 33, wherein said storage means comprise an electrically erasable pro- 
grammable read only memory. 
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(2), an external controiler intBffece .(4)^. in> 
pjjiidiultpiA ports (6), and an external memory' 
Interface (8). The input/output ports (6) allow 
interfacing to equipment such as digitizing 
cameras and video display monitors. Single 
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There is one PE (2) for each digital sampie, e.g. 
a pixel, and a sequence of, data saimples, e.g. a^ 
video line, can be processed In reaMame: ^ch 
PE (2) includes a bit-serial arithmetic logic unit. 
(14), a cache memory (18) and an extemal 
memory Interface register (20). All communi- 
catioh within a PE (2) Is along a. one-bit wide 
bus, Le. vertical mode, using a single ifnove 
instruction with .variable source and destination 
addresses, the processing capability of the de- 
vice (1) is further enhanced by having a serial- 
parallel multiplier (16) in each PE (2). For appli- 
cations requiring more PE's, two or more de- 
vices can be cascaded together. In another 
enrtbodiment dual-port memories provide hori- 
zont£d communicatbn between the PE's in the 
linear array. 
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